-
Notifications
You must be signed in to change notification settings - Fork 680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] fix ArrayNode state's TaskPhase reset #5451
Conversation
Signed-off-by: Paul Dittamo <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5451 +/- ##
=======================================
Coverage 60.91% 60.92%
=======================================
Files 796 796
Lines 51689 51689
=======================================
+ Hits 31488 31493 +5
+ Misses 17294 17289 -5
Partials 2907 2907
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
This reverts commit 9f4306a. Signed-off-by: Paul Dittamo <[email protected]>
Signed-off-by: Paul Dittamo <[email protected]>
|
||
// if the ArrayNode phase has changed we need to reset the taskPhaseVersion to 0 | ||
if currentArrayNodePhase != arrayNodeState.Phase { | ||
arrayNodeState.TaskPhaseVersion = 0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't setting this afterwards cause problems with admin requiring incremental values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario of interest would be if we would lose some eventing data if we have incrementPhase = False and currentArrayNodePhase != arrayNodeState.Phase + have a previous event emitted with the same TaskPhase and TaskPhaseVersion.
We set incrementTaskPhaseVersion = True if there's a subnode phase change. The arrayNodeState.Phase is updated in 3 different places: arrayNodeState.Phase = v1alpha1.ArrayNodePhaseFailing, arrayNodeState.Phase = v1alpha1.ArrayNodePhaseSucceeding, and arrayNodeState.Phase = v1alpha1.ArrayNodePhaseExecuting.
For arrayNodeState.Phase = v1alpha1.ArrayNodePhaseFailing and arrayNodeState.Phase = v1alpha1.ArrayNodePhaseSucceeding, there would have to be subnode phase change so we couldn't have a scenario where incrementPhase = False and currentArrayNodePhase != arrayNodeState.Phase.
For arrayNodeState.Phase = v1alpha1.ArrayNodePhaseExecuting, we shouldn't have a previous event emitted with the same TaskPhase and TaskPhaseVersion as we should only be in the v1alpha1.ArrayNodePhaseNone phase for the first pass through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proceeding loops would have a new TaskPhase as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario of interest would be if we would lose some eventing data if we have incrementPhase = False and currentArrayNodePhase != arrayNodeState.Phase + have a previous event emitted with the same TaskPhase and TaskPhaseVersion.
Do we have a repro for this? I'm having difficulty understanding how this is possible. IfincrementPhase == False
then no subnode phases have been updated socurrentArrayNodePhase != arrayNodeState.Phase
cannot be true since they are determined from subNode phases - unless this is a system failure in the handler logic?
Then taskPhase
is deterministic on currentArrayNodePhase
, so if currentArrayNodePhase != arrayNodeState.Phase
then we can not have emitted an event with the same taskPhase
and taskPhaseVersion
. FlyteAdmin does not take taskPhaseVersion
into accord for new taskPhase
values.
I'm probably missing something here, a repro would help. Or is there an issue this should be linked to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hamersaw I noticed this when working on mapping over launch plans but don't think I actually ran into the bug. However this should repro an issue.
@task
def hello(num: int) -> int:
if num > 9:
time.sleep(10)
raise Exception("This is a test exception")
return num
@workflow
def map_workflow():
map_task(hello, min_success_ratio=0.5)(num=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
A subnode task phase will be stuck in "Running" even though it has terminated/failed.
Emitting the event always fails due to the task phase not getting bumped but the task phase version getting reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, this doesn't occur when the staggered subtask succeeds. Looking back into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured it out:
this doesn't bubble up for:
@task
def hello(num: int) -> int:
if num > 9:
time.sleep(10)
return num
@workflow
def map_workflow():
map_task(hello, min_success_ratio=0.5)(num=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
flyte/flytepropeller/pkg/controller/nodes/array/handler.go
Lines 374 to 386 in fc1c92c
for _, nodePhaseUint64 := range arrayNodeState.SubNodePhases.GetItems() { | |
nodePhase := v1alpha1.NodePhase(nodePhaseUint64) | |
switch nodePhase { | |
case v1alpha1.NodePhaseSucceeded, v1alpha1.NodePhaseRecovered, v1alpha1.NodePhaseSkipped: | |
successCount++ | |
case v1alpha1.NodePhaseFailing: | |
failingCount++ | |
case v1alpha1.NodePhaseFailed, v1alpha1.NodePhaseTimedOut: | |
failedCount++ | |
default: | |
runningCount++ | |
} | |
} |
flyte/flytepropeller/pkg/controller/nodes/array/handler.go
Lines 403 to 409 in fc1c92c
if len(arrayNodeState.SubNodePhases.GetItems())-failedCount < minSuccesses { | |
// no chance to reach the minimum number of successes | |
arrayNodeState.Phase = v1alpha1.ArrayNodePhaseFailing | |
} else if successCount >= minSuccesses && runningCount == 0 { | |
// wait until all tasks have completed before declaring success | |
arrayNodeState.Phase = v1alpha1.ArrayNodePhaseSucceeding | |
} |
Since we don't do that we bump task phase version instead of resetting to 0 to the event lands in admin. Meanwhile for NodePhaseFailing, we don't increment running so we update arrayNodeState.Phase which then leads for the existing implementation to reset the task phase version.
Signed-off-by: Paul Dittamo <[email protected]>
* set updated array node phase for task exec event Signed-off-by: Paul Dittamo <[email protected]> * Revert "set updated array node phase for task exec event" This reverts commit 9f4306a. Signed-off-by: Paul Dittamo <[email protected]> * reset taskPhaseVersion due to nodephase change after emitting event Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> Signed-off-by: Vladyslav Libov <[email protected]>
Tracking issue
Resolves: #5201
Why are the changes needed?
When a NodePhase change is detected, the current implementation resets the TaskPhaseVersion to 0 prior to emitting the event. This would create a duplicate event since the TaskPhase isn't also updated causing for the event to not get emitted/persisted to admin.
What changes were proposed in this pull request?
We opt to do this instead of bumping the task phase as that can cause issues.
Example issue:
Also this seems to be consistent with letting the proceeding propeller loop handling the next state.
How was this patch tested?
Ran through different failing and succeeding scenarios and ensured that the subnodes phases transitioned to the correct terminal phase.
Setup process
Repro issues:
Run this multiple times:
Screenshots
Check all the applicable boxes
Related PRs
Docs link